Robust Document Clustering by Exploiting Feature Diversity in Cluster Ensembles

نویسندگان

  • Xavier Sevillano
  • Germán Cobo
  • Francesc Alías
  • Joan Claudi Socoró
چکیده

Resumen: Las prestaciones de los sistemas de clasificación no supervisada de documentos están supeditadas al uso de representaciones textuales óptimas, las cuales no son sólo dif́ıciles de determinar de antemano, sino que pueden variar de un problema de clasificación a otro. Este trabajo propone una metodoloǵıa basada en diversidad de representaciones y conjuntos de clasificadores no supervisados como primer paso hacia la construcción de sistemas robustos de clasificación no supervisada. Los experimentos realizados sobre tres problemas de categorización binaria de dificultad creciente muestran que el método propuesto es i) robusto frente a selecciones no óptimas de la dimensionalidad de las representaciones, y ii) capaz de detectar interacciones constructivas entre distintas representaciones textuales, llegando a obtener ı́ndices de categorización por consenso superiores a los conseguidos por los clasificadores individuales disponibles. Palabras clave: Representación de documentos, clasificación no supervisada, conjuntos de clasificadores.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Diversity-Based Weighting Schemes for Clustering Ensembles

Clustering ensembles has been recently recognized as an emerging approach to provide more robust solutions to the data clustering problem. Current methods of clustering ensembles typically fall into instance-based, cluster-based, or hybrid approaches; however, most of such methods fail in discriminating among the various clusterings that participate to the ensemble. In this paper, we address th...

متن کامل

Adaptive Cluster Ensemble Selection

Cluster ensembles generate a large number of different clustering solutions and combine them into a more robust and accurate consensus clustering. On forming the ensembles, the literature has suggested that higher diversity among ensemble members produces higher performance gain. In contrast, some studies also indicated that medium diversity leads to the best performing ensembles. Such contradi...

متن کامل

Selecting Diversifying Heuristics for Cluster Ensembles

Cluster ensembles are deemed to be better than single clustering algorithms for discovering complex or noisy structures in data. Various heuristics for constructing such ensembles have been examined in the literature, e.g., random feature selection, weak clusterers, random projections, etc. Typically, one heuristic is picked at a time to construct the ensemble. To increase diversity of the ense...

متن کامل

A new ensemble clustering method based on fuzzy cmeans clustering while maintaining diversity in ensemble

An ensemble clustering has been considered as one of the research approaches in data mining, pattern recognition, machine learning and artificial intelligence over the last decade. In clustering, the combination first produces several bases clustering, and then, for their aggregation, a function is used to create a final cluster that is as similar as possible to all the cluster bundles. The inp...

متن کامل

The Heterogeneous Cluster Ensemble Method Using Hubness for Clustering Text Documents

We propose a cluster ensemble method to map the corpus documents into the semantic space embedded in Wikipedia and group them using multiple types of feature space. A heterogeneous cluster ensemble is constructed with multiple types of relations i.e. document-term, documentconcept and document-category. A final clustering solution is obtained by exploiting associations between document pairs an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Procesamiento del Lenguaje Natural

دوره 37  شماره 

صفحات  -

تاریخ انتشار 2006